Project-Team:CORSE

Inria | Raweb 2017 | Presentation of the Project-Team CORSE | CORSE Web Site


	PDF	e-Pub

Previous |

Home | Next next

Section: New Results

Improving Characterization of NUMA Architectures through Applications' Kernels

Participants : Philippe Virouleau, Francois Broquedis, Thierry Gautier, Julien Langou [UCD, USA] , Fabrice Rastello.

Programmers need tools to be able to study their applications. When targeting NUMA architectures, many existing tools allow to observe and identify the critical parts of the application. However there is a need for tools that enable programmers to clearly understand how critical parts of their applications behave, and how they could be improved on a given architecture.

In the context of data-flow applications each part - task - of the application is clearly identified in the data-flow graph. All manipulated data are also clearly available as, within such framework, they constitute what links tasks with one another.

On NUMA architectures, a task's execution time depends, among others, on both the core which executes the task and the NUMA node on which has been allocated its data. Assume one can characterize a task behavior (with regard to its execution context) as follow: run it in isolation from the overall application, and change various of its properties (such as the size of input or the placement of data). Then the scheduler of a run-time system can use this characteristic to improve the overall performance: It would have full information about what is running on the machine (e.g.: on the same NUMA node as the idle thread), and could sort the tasks ready for execution according to how good their behavior would be on the idle thread, given the current state.

We designed a tool which goal is to help the user execute a given scenario on the architecture. This scenario describes:

What are the data and where to allocate them on the architecture
What are the tasks to execute, where to execute them on the architecture, and with which data
What are the characteristics to observe for each task (execution time, performance counters, ...)

The tool guarantees that the scenario will be executed correctly on the architecture, letting him focus on understanding on his application rather than taking care of the low level implementation details.

We applied this approach to a dense linear algebra algorithm: the Cholesky factorization. It has enabled us to profile the four kernels of the application by running them in various configurations of data placements, sizes, and concurrent workload. We believe we've tested enough configurations to reliably find the best and worst cases for all the kernels Assuming the behavior of the kernel stays the same within the application, we've been able to estimate upper-bound and lower-bound execution time for the overall application given those best and worst cases.

Previous |

Home | Next next